Identification of high-risk factors for recurrence of colon cancer following complete mesocolic excision: An 8-year retrospective study

Background Colon cancer recurrence is a common adverse outcome for patients after complete mesocolic excision (CME) and greatly affects the near-term and long-term prognosis of patients. This study aimed to develop a machine learning model that can identify high-risk factors before, during, and after surgery, and predict the occurrence of postoperative colon cancer recurrence. Methods The study included 1187 patients with colon cancer, including 110 patients who had recurrent colon cancer. The researchers collected 44 characteristic variables, including patient demographic characteristics, basic medical history, preoperative examination information, type of surgery, and intraoperative information. Four machine learning algorithms, namely extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and k-nearest neighbor algorithm (KNN), were used to construct the model. The researchers evaluated the model using the k-fold cross-validation method, ROC curve, calibration curve, decision curve analysis (DCA), and external validation. Results Among the four prediction models, the XGBoost algorithm performed the best. The ROC curve results showed that the AUC value of XGBoost was 0.962 in the training set and 0.952 in the validation set, indicating high prediction accuracy. The XGBoost model was stable during internal validation using the k-fold cross-validation method. The calibration curve demonstrated high predictive ability of the XGBoost model. The DCA curve showed that patients who received interventional treatment had a higher benefit rate under the XGBoost model. The external validation set’s AUC value was 0.91, indicating good extrapolation of the XGBoost prediction model. Conclusion The XGBoost machine learning algorithm-based prediction model for colon cancer recurrence has high prediction accuracy and clinical utility.


Introduction
Colon cancer is a gastrointestinal tumor that carries a grave prognosis. The incidence of colorectal cancer is on the rise due to changes in lifestyle and dietary habits, and there is a gradual shift in the incidence from the distal rectum to the proximal colon. According to the 2019 epidemiological survey [1], colon cancer ranks as the third most common malignancy worldwide, after lung cancer and breast cancer. To decrease the morbidity and mortality of colon cancer patients, Hohenberger proposed complete mesocolic excision (CME), which involves the removal of the tumor along with the colonic mesentery, followed by the ligation of tumor vessels at the root to ensure radical surgery. As surgical techniques continue to develop, open surgery has given way to laparoscopic and robot-assisted surgery, leading to further improvements in the prognosis of colon cancer patients [2,3]. However, despite the effectiveness of radical colon cancer surgery, clinicians have discovered that some patients have poor outcomes, such as tumor recurrence and distant metastases, which have extremely high mortality rates [4]. According to one study [5], tumor recurrence is the primary cause of postoperative death in colon cancer patients. Thus, it is essential to identify the risk factors for colon cancer recurrence and predict its occurrence.
Artificial intelligence (AI) is advancing rapidly in the medical field [6]. As a significant branch of AI, machine learning offers more stable model building and more accurate prediction, making it a popular choice among clinicians and widely used in clinical prediction and other areas [7,8]. In this study, we analyzed the clinical data of colon cancer patients and employed machine learning algorithms to develop a prediction model for colon cancer recurrence. This will enable clinicians to formulate precise individualized treatment plans promptly and improve the postoperative survival rate of patients.

Study subjects
In the current study, we utilized clinical data from a database of colon cancer patients at Wuxi People's Hospital from January 2010 to January 2018. The inclusion criteria for cases were as follows: (1) patients who underwent open CME or laparoscopic-assisted CME; (2) the surgical team consisted of senior doctors who were able to independently perform CME; and (3) patients were diagnosed with colon cancer by imaging and tumor pathology. The exclusion criteria for cases were as follows: (1) patients with other malignant tumors; (2) patients with serious cardiovascular and cerebrovascular diseases or liver, kidney, and other significant organ diseases; and (3) case records with missing or lost visits. The patients in this study were monitored for a minimum of 5 years after undergoing surgery, during which time they were regularly examined by two surgeons who conducted medical history reviews, physical examinations, and imaging tests including abdominal ultrasounds and computed tomography (CT) scans every three months. The Ethics Committee of Wuxi People's Hospital approved this study, with approval number KY22085. As this retrospective investigation was conducted, and in adherence to local laws and regulations, the Ethics Committee granted a waiver for the necessity of informed consent, as we have diligently anonymized all patient data.

Study design and data collection
A total of 44 preoperative variables (within 24 h of the day of surgery), intraoperative variables, and postoperative variables (within 48 h of the initial surgery) were collected. Preoperative variables collected included patient demographics (gender, age, smoking history, alcohol history, and body mass index), basic clinical characteristics (American Society of Anesthesiologists score, nutrition risk screening 2002 score, surgical history, disease duration, adjuvant chemotherapy history, and adjuvant radiotherapy history), basic medical history (anemia, diabetes, ileus, hypertension, hyperlipidemia, and coronary artery disease), laboratory tests (albumin, carcinoembryonic antigen, carbohydrate antigen 19-9, carbohydrate antigen 125, and carbohydrate antigen 72-4), tumor characteristics (T-stage, N-stage, peripheral nerve invasion, vascular invasion, tumor size, tumor number, tumor configuration, and pathologic type). Intraoperative variables collected included surgical approach, type of surgery, duration of surgery, intraoperative bleeding, number of surgically cleared lymph nodes, and whether it was an emergency surgery. Postoperative variables collected included laboratory test indices (carcinoembryonic antigen, carbohydrate antigen 19-9, carbohydrate antigen 125, carbohydrate antigen 72-4, procalcitonin, C-reactive protein, serum amyloid A, and neutrophil to lymphocyte ratio) and tumor characteristics (tumor recurrence).

Development and evaluation of predictive models for machine learning algorithms
The statistical software programs SPSS and R were utilized to develop and assess the clinical prediction models. (1) Univariate and multivariate regression analyses were conducted. Categorical variables were compared between the two groups using the chi-square test, while the ttest was used for continuous variables that followed a normal distribution. For continuous variables that did not meet the normal distribution criteria, the rank sum test was used. Statistical significance was determined by a p-value of less than 0.05. Logistic regression analysis was performed on variables that showed significance in the univariate analysis to identify independent factors that influenced the occurrence of postoperative colon cancer recurrence. Four predictive models, namely extreme gradient boosting (XGBoost), random forest (RF), support vector machine (SVM), and k-nearest neighbor algorithm (KNN), were utilized to score and rank the significance of all the variables. Variables that appeared in the top ten rankings in all four models and were also significant in both univariate and multivariate regression analyses were chosen. (2) Evaluation and development of prediction models. Colon cancer patients diagnosed between January 2010 and December 2016 were selected as the internal validation set, while patients diagnosed between January 2017 and January 2018 were chosen as the external validation set. The internal validation set was divided randomly into a training set (70%) and a test set (30%). The top ten variables, selected based on their significance in univariate and multivariate regression analyses and ranking in the top ten in all four machine learning algorithm models (SVM, RF, XGBoost, and KNN), were incorporated into the four prediction models. Three aspects were used to evaluate the models: discrimination, calibration, and clinical usefulness. The best model was selected for prediction analysis. Receiver operating characteristic (ROC) curves were plotted to determine the area under the curve (AUC) values and predictive efficacy of the models. Calibration curves were used to assess whether the models predicted actual results with good agreement, while decision curve analysis (DCA) was used to assess the benefits of patients after interventional treatment. Internal validation was completed using the k-fold cross-validation method. (3) External validation of the optimal model was conducted using an independent test set. The ROC curve was plotted to evaluate the predictive accuracy and generalizability of the model. (4) Model interpretation. The Shapley value is utilized in SHAP analysis to obtain the contribution of each feature in the sample to the prediction. Based on the Shapley values, the SHAP summary plot is generated to rank the importance of risk factors, and the SHAP force plot is constructed to analyze and interpret the prediction results of individual samples.

Basic clinical information of the patient
A total of 1187 patients were included in the study, including 110 (9.27%) patients with recurrent colon cancer (Fig 1). The original data presented in the study are included in the S1 Table.

Analysis of risk factors for postoperative recurrence of colon cancer
The results of both univariate and multivariate analyses indicated that T-stage, N-stage, liver metastases, vascular invasion, tumor number, tumor size, preoperative carcinoembryonic antigen (CEA) level, postoperative CEA level, preoperative carbohydrate antigen 19-9 (CA19-9) level, postoperative CA19-9 level, albumin (ALB), and emergency surgery had significant independent effects on colon cancer recurrence (P<0.05) ( Table 1). The XGBoost, RF, SVM, and KNN models were used to identify the risk factors affecting the recurrence of colon cancer, and the top variables selected were N-stage, liver metastases, tumor number, tumor size, postoperative carbohydrate antigen 125 (CA125) level, C-reactive protein (CRP) level, neutrophil to lymphocyte ratio (NLR), and postoperative CEA level (Fig 2A-2D). Based on these results, the variables used to construct the predictive model in this study were N-stage, liver metastases, tumor number, tumor size, and postoperative CEA level.

Model building and evaluation
The results of the ROC curve analysis showed that the XGBoost model had the highest AUC value in both the training set (0.962) and the validation set (0.952), indicating good discrimination ability ( Table 2). The calibration curve analysis showed that the predicted probabilities from the XGBoost model were well-calibrated with the actual probabilities. The Brier score of XGBoost was the lowest among the four models, indicating good accuracy of the predicted probabilities. The DCA curves showed that all four models had a net clinical benefit, with  XGBoost having the highest net benefit at most probability thresholds (Fig 3A-3D). The kfold cross-validation method was used to evaluate the generalization ability of the four models. A test set of 264 cases (30.10%) was randomly selected from the overall dataset, and the remaining samples were used as the training set for 10-fold cross-validation. The XGBoost model had an AUC of 0.9358±0.0391 for the validation set and an AUC of 0.9158 for the test set, with an accuracy of 0.8939 (Fig 4A-4C). The RF model had an AUC of 0.9177±0.0709 for the validation set and an AUC of 0.8734 for the test set, with an accuracy of 0.8939. The SVM model had an AUC of 0.8451±0.1078 for the validation set and an AUC of 0.8183 for the test

PLOS ONE
Identification of high-risk factors for recurrence of colon cancer set, with an accuracy of 0.9583. The KNN model had an AUC of 0.8801±0.0661 for the validation set and an AUC of 0.8715 for the test set, with an accuracy of 0.9242. After comprehensive comparison, the XGBoost algorithm was chosen to construct the predictive model in this study.

Model external validation
The results obtained from the ROC curve showed an AUC value of 0.91 for the external validation set, which is a strong indication that the prediction model has high accuracy in determining the occurrence of the disease (Fig 4D).

Model explanation
The SHAP summary plot revealed that the risk factors for the recurrence of colon cancer were ranked in the following order: tumor size, N-stage, postoperative CEA level, tumor number, and liver metastases (Fig 5). The SHAP force plots depict the predictive analysis of the study model for four patients who had recurrent colon cancer. For patient I, the model predicted a 0.076 probability of recurrence, with an increased probability of tumor volume � 5 cm and tumor lymphatic metastasis. For patient II, the model predicted a 0.007 probability of recurrence, with an increased probability of tumor lymphatic metastasis. For patient III, the model predicted a 0.365 probability of recurrence, with an increased probability of tumor volume � 5 cm and tumor liver metastasis. For patient IV, the model predicted a 0.747 probability of recurrence, with an increased probability of tumor volume � 5 cm, tumor lymphatic metastasis, and postoperative CEA � 5 ng/ml (Fig 6A-6D).

Discussion
This study aimed to evaluate the risk prediction models constructed by four machine learning algorithms, and among them, the XGBoost algorithm was found to exhibit exceptional accuracy and efficiency. Unlike the RF algorithm, the XGBoost algorithm takes into account the regularization problem and effectively avoids overfitting of the model [9]. In comparison to the SVM algorithm and KNN algorithm, the XGBoost algorithm is better suited for large sample sizes and multiple feature variables, which reduces the computational and training time required [10]. Therefore, the XGBoost algorithm was chosen to construct a model to predict the recurrence of colon cancer after surgery. The prediction model serves at least two purposes, one of which is to clarify the risk factors for tumor recurrence, and the other is to prompt clinicians to take timely interventions for high-risk patients to reduce the risk of tumor recurrence. In this study, SHAP analysis was used to interpret the model, and the results showed that CEA �5 ng/ml, tumor size, lymphatic metastasis, liver metastasis, and multiple tumors were identified as risk factors for the recurrence of colon cancer after radical colon cancer surgery. The greater the size of a tumor, the deeper it infiltrates the surrounding tissues, thereby increasing the probability of lymphatic and distant metastasis, and rendering complete surgical intervention more difficult. The National Comprehensive Cancer Network (NCCN) and the American Joint Committee on Cancer (AJCC) have laid out explicit guidelines regarding the radicality of colon cancer surgery, emphasizing that the procedure should excise a sufficiently extensive section of bowel to ensure negative surgical margins [11,12]. However, the depth of tumor invasion may be too extensive to enable the surgeon to precisely determine the extent of the lesion resection with the naked eye. Moreover, performing intraoperative rapid pathological examination to guarantee negative margins is often challenging, resulting in an augmented risk of postoperative tumor remnants. Additionally, larger tumors tend to divide at a quicker pace, generating more tumor vessels. Tomisaki's analysis of 175 colon cancer patients demonstrated a strong correlation between metastatic recurrence of colon cancer and tumor microvessel density (MVD). The higher the MVD, the more likely tumor cells are to enter the circulatory system, exacerbating the risk of recurrence [13]. Furthermore, Park found that tumor cells originating from larger tumors are more prone to shedding into the abdominal and pelvic cavities, as well as the vascular tissue, further increasing the probability of tumor recurrence post-surgery [14]. Tumor recurrence is comparably prevalent among patients diagnosed with multiple colon cancers. Li [15] assessed this supposition through the implementation of two distinct mouse models. Specifically, mice within the experimental group underwent conventional tumor resection, while mice within the control group underwent sham surgery. Remarkable distinctions were identified in the size of tumor growth and the extent of recurrence within the experimental group compared to the control group.
The findings of the present investigation suggest that postoperative CEA levels may serve as an indicator of the likelihood of colon cancer recurrence in patients. Gold previously regarded CEA as an acidic glycoprotein produced by normal human mucosal cells, which lacked specificity for diagnosing colorectal cancer [16,17]. However, in recent years, medical testing techniques have advanced and clinicians have come to acknowledge the significance of CEA. An earlier prospective study analyzed the correlation between serum tumor marker concentrations in colon cancer patients and clinical factors, revealing a positive association between elevated CEA levels and colon cancer development [18]. Subsequently, Tsuyoshi et al. reported that most patients experienced a return of their serum CEA concentrations to normal levels three months following radical colon cancer surgery. In contrast, a subset of patients whose postoperative CEA levels did not decrease from preoperative levels had a high risk of rapid tumor recurrence. The elevated CEA levels following surgery can serve as a marker for colon cancer recurrence, which is consistent with the outcomes of the present study [19]. In recent times, some medical practitioners have employed a combination of preoperative CEA, CA19-9, CK-1, and MUC-1 to detect colon cancer in patients diagnosed with the disease. This approach has shown to enhance the sensitivity and specificity of tumor monitoring, as well as assess tumor stage and metastasis more accurately, and is particularly useful in predicting the likelihood of postoperative recurrence in patients [20].
Given that most of the blood flow from the gastrointestinal tract returns via the portal system, the liver is among the most frequently metastasized organs in advanced gastrointestinal tumors, with approximately 20% of colon cancer patients developing liver metastases during the course of their disease [21,22]. The optimal treatment approach for colon cancer patients with multiple liver metastases involves resection of liver metastases in conjunction with radical colon cancer surgery. However, up to 40% of colon cancer cases remain after surgery, with complete eradication of the tumor proving to be difficult. The present study findings indicate that patients with preoperative liver metastases are at an increased risk of postoperative tumor recurrence. Metastatic colon cancer cells in the liver are known to exist in a dormant state. However, any alteration in the immune system or the organ microenvironment can activate these cells, leading to postoperative recurrence [23]. Liver cells are considered to be stable cells with a high regenerative capacity, but trauma or surgical resection can cause these cells to transition from a stable to a dividing state. Several studies [24][25][26] have suggested that proliferating liver cells can promote the growth of tumor cells. Residual tumor cells in the liver after surgery may also activate the hepatic epidermal growth factor receptor (EGFR), leading to the promotion of tumor recurrence. Additionally, after hepatectomy, endothelial cell growth factor (ECGF) is upregulated due to the remodeling of liver vasculature, which can stimulate tumor vascular growth [25,26]. Hepatocyte growth factor (HGF) is the most potent mitogen that stimulates liver cell proliferation. After hepatectomy, overexpression of HGF also activates dormant residual cancer cells [27][28][29]. Notably, metastatic liver carcinomas can express matrix metalloproteinase-2 (MMP-2), which is closely associated with tumor recurrence and metastasis. On one hand, MMP-2 can decompose basement membrane glycoproteins and extracellular matrix protein components, thereby promoting tumor invasion and metastasis. Furthermore, MMP-2 can encourage its own secretion by positively regulating MMP-1. On the other hand, MMP-2 also plays a role in promoting tumor vascular proliferation, thereby increasing the risk of tumor recurrence [30,31].
SHAP analysis has revealed that lymphatic metastasis is a major risk factor for postoperative recurrence in patients with colon cancer. This mechanism is primarily observed in two aspects. Firstly, there exists a dense lymph node network in the colonic mesentery around the tumor, which complicates surgical radical treatment post tumor invasion and limits complete tumor removal. Secondly, tumors frequently metastasize to retroperitoneal organs via lymph node metastasis. Clinical manifestations in patients are often subtle, and imaging examinations pose a challenge in diagnosis. These factors contribute to the elevated risk of postoperative tumor recurrence [32,33]. David's study [32] similarly found a close correlation between lymph node metastasis and tumor recurrence, and Radespiel's [34] study discovered that a higher number of lymph node metastases lead to an increased chance of tumor recurrence and postoperative mortality rate. Therefore, it is essential for the surgeon to thoroughly clear the pertinent lymph nodes during radical colon cancer surgery, prevent squeezing of the tumor, and avoid tumor dissemination into the abdominal cavity [35].
The present study also examined factors such as surgical approach to evaluate tumor recurrence and found no significant difference between the two approaches, which remains somewhat controversial in clinical practice. Aasmund [36] concluded that laparoscopic surgery adheres to the concept of minimally invasive surgery, which has minimal impact on the patient's immune system and reduces the likelihood of tumor recurrence in postoperative patients. Conversely, Mirow [37] suggested that the trocar used in laparoscopic surgery may cause tumor implantation. Therefore, clinicians should opt for minimally invasive surgical approaches when treating patients with colon cancer to reduce patient trauma. Moreover, operators should strictly adhere to the tumor-free principle and avoid contact with the tumor when inserting the trocar to minimize the risk of tumor dissemination.
In recent years, numerous prediction models have been constructed to predict colon cancer recurrence with varying degrees of success [38][39][40]. However, many of these models have been constructed using parametric regression which assumes linear relationships between clinical characteristic variables. Unfortunately, patient prognosis cannot be accurately predicted using regression models alone due to the complex interrelationships between clinical variables. To address this, the present study utilized the XGBoost machine learning method to construct a prediction model for tumor recurrence after radical colon cancer surgery that can meet the practical needs of clinical decision making. The proposed model recommends that clinicians utilize a combination of CEA, CA19-9, and other carcinoembryonic antigens for timely follow-up review of postoperative patients. For patients presenting with symptoms such as low back pain or intestinal obstruction, CT and other imaging examinations can also be used to diagnose whether patients have retroperitoneal metastasis. Research conducted by Shibata [32] shows that the survival rate of patients with recurrent colon cancer is low when only radiotherapy and chemotherapy are administered. Resurgical treatment has demonstrated significantly better efficacy than nonsurgical treatment, and surgery remains the primary treatment for patients with recurrent colon cancer [32,41]. For patients with large tumors or multiple tumors that cannot be completely resected, chemotherapy should be administered early to reduce tumor size prior to radical resection.
The present study has evaluated the model thoroughly in terms of discrimination, calibration, and clinical utility; however, there are several limitations that should be noted. Firstly, imaging and other related factors were not considered in the study, which might affect the accuracy of the prediction model. The prognosis of tumor patients greatly hinges upon Lynch syndrome, MMR gene, MSI-H, and genetic mutations. However, regrettably, this study lacked the requisite data to conduct a comprehensive predictive analysis in this regard. Nevertheless, we aim to ameliorate this research in the future by gathering pertinent data, thereby offering more advantageous insights for the prognosis of colorectal cancer patients. Additionally, the study was limited to a single center and was conducted retrospectively, which could lead to selection bias, distribution bias, and retrospective bias. Therefore, in future studies, it is recommended to include multicenter prospective studies to increase the reliability and generalizability of the results.

Conclusion
A model utilizing the XGBoost machine learning algorithm was developed in this study to predict the likelihood of tumor recurrence in colon cancer patients following surgery. The model was found to possess robust predictive accuracy and clinical utility, providing surgeons with an effective diagnostic tool for timely identification of high-risk patients. The model identifies postoperative tumor recurrence as a significant obstacle in the management of CME after surgery, highlighting factors such as postoperative CEA, tumor size, lymphatic and liver metastasis, and number of tumors as closely associated with the risk of recurrence.